Problem definition: Given a 2013 sales data for 1559 products across 10
stores in different cities.The aim of this data exploration is to identify the
most important attributesthat play a statictically significant impact on the
sales of each product at a particular store. 

========================================================

Explore the data a bit

##  [1] "Item_Identifier"           "Item_Weight"              
##  [3] "Item_Fat_Content"          "Item_Visibility"          
##  [5] "Item_Type"                 "Item_MRP"                 
##  [7] "Outlet_Identifier"         "Outlet_Establishment_Year"
##  [9] "Outlet_Size"               "Outlet_Location_Type"     
## [11] "Outlet_Type"               "Item_Outlet_Sales"

Dimension of data

## [1] 8523   12

There are 8523 obervation for each store and 12 number of attributes for each
store.

Structure of data

## 'data.frame':    8523 obs. of  12 variables:
##  $ Item_Identifier          : Factor w/ 1559 levels "DRA12","DRA24",..: 157 9 663 1122 1298 759 697 739 441 991 ...
##  $ Item_Weight              : num  9.3 5.92 17.5 19.2 8.93 ...
##  $ Item_Fat_Content         : Factor w/ 5 levels "LF","low fat",..: 3 5 3 5 3 5 5 3 5 5 ...
##  $ Item_Visibility          : num  0.016 0.0193 0.0168 0 0 ...
##  $ Item_Type                : Factor w/ 16 levels "Baking Goods",..: 5 15 11 7 10 1 14 14 6 6 ...
##  $ Item_MRP                 : num  249.8 48.3 141.6 182.1 53.9 ...
##  $ Outlet_Identifier        : Factor w/ 10 levels "OUT010","OUT013",..: 10 4 10 1 2 4 2 6 8 3 ...
##  $ Outlet_Establishment_Year: int  1999 2009 1999 1998 1987 2009 1987 1985 2002 2007 ...
##  $ Outlet_Size              : Factor w/ 4 levels "","High","Medium",..: 3 3 3 1 2 3 2 3 1 1 ...
##  $ Outlet_Location_Type     : Factor w/ 3 levels "Tier 1","Tier 2",..: 1 3 1 3 3 3 3 3 2 2 ...
##  $ Outlet_Type              : Factor w/ 4 levels "Grocery Store",..: 2 3 2 1 2 3 2 4 2 2 ...
##  $ Item_Outlet_Sales        : num  3735 443 2097 732 995 ...
Find any entries which are blanck and fill NA to those entries.  It is observed that item_weight has missing vaues.
It is observed that Outlet size and and Item_Visibilities has entries which
are empty or zero.Item Visibility zero means there is no prodcut in the outlet
which does not make any sennse and require data imputation.
Looking at missing data pattern visually The following figure is useful for investigating any structure of missing
observation in the data.Also, the missing pattern could suggest which variables
could potentially be useful for imputation of missing entries.
md.pattern(bigmarts_train) 
##      Item_Identifier Item_Fat_Content Item_Type Item_MRP Outlet_Identifier
## 4358               1                1         1        1                 1
## 1373               1                1         1        1                 1
##  292               1                1         1        1                 1
## 2266               1                1         1        1                 1
##   90               1                1         1        1                 1
##  144               1                1         1        1                 1
##                    0                0         0        0                 0
##      Outlet_Establishment_Year Outlet_Location_Type Outlet_Type
## 4358                         1                    1           1
## 1373                         1                    1           1
##  292                         1                    1           1
## 2266                         1                    1           1
##   90                         1                    1           1
##  144                         1                    1           1
##                              0                    0           0
##      Item_Outlet_Sales Item_Visibility Item_Weight Outlet_Size     
## 4358                 1               1           1           1    0
## 1373                 1               1           0           1    1
##  292                 1               0           1           1    1
## 2266                 1               1           1           0    1
##   90                 1               0           0           1    2
##  144                 1               0           1           0    2
##                      0             526        1463        2410 4399

## 
##  Variables sorted by number of missings: 
##                   Variable      Count
##                Outlet_Size 0.28276428
##                Item_Weight 0.17165317
##            Item_Visibility 0.06171536
##            Item_Identifier 0.00000000
##           Item_Fat_Content 0.00000000
##                  Item_Type 0.00000000
##                   Item_MRP 0.00000000
##          Outlet_Identifier 0.00000000
##  Outlet_Establishment_Year 0.00000000
##       Outlet_Location_Type 0.00000000
##                Outlet_Type 0.00000000
##          Item_Outlet_Sales 0.00000000
## [1] 526

On the left side of historgram the variables are sorted by the number of
missing vlaues.
The variables Outlet_size, Item_Weight, Item_Visibility are missing mostly.
On the rigt side of the graph the blue represent the observed data and the
red represent the missing data.
Cleaning the data: Using imputation statistical techniques
Imputation:imputation is the process of replacing missing data with
substituted values.pmm(Predictive mean matching).There are some values for item
vsibility are zero which is not correct.
Item weight and Outlet size ahve missing values,so needed to fix with imputation.

Display first 5 records

##   Item_Identifier Item_Weight Item_Fat_Content Item_Visibility
## 1           FDA15        9.30          Low Fat      0.01604730
## 2           DRC01        5.92          Regular      0.01927822
## 3           FDN15       17.50          Low Fat      0.01676007
## 4           FDX07       19.20          Regular              NA
## 5           NCD19        8.93          Low Fat              NA
##               Item_Type Item_MRP Outlet_Identifier
## 1                 Dairy 249.8092            OUT049
## 2           Soft Drinks  48.2692            OUT018
## 3                  Meat 141.6180            OUT049
## 4 Fruits and Vegetables 182.0950            OUT010
## 5             Household  53.8614            OUT013
##   Outlet_Establishment_Year Outlet_Size Outlet_Location_Type
## 1                      1999      Medium               Tier 1
## 2                      2009      Medium               Tier 3
## 3                      1999      Medium               Tier 1
## 4                      1998        <NA>               Tier 3
## 5                      1987        High               Tier 3
##         Outlet_Type Item_Outlet_Sales
## 1 Supermarket Type1         3735.1380
## 2 Supermarket Type2          443.4228
## 3 Supermarket Type1         2097.2700
## 4     Grocery Store          732.3800
## 5 Supermarket Type1          994.7052

The data set is divided into two type of variables:
Categorical variables: Item_Identifier, Item_Fat_Content, Item_Type, Outlet_Identifier,Outlet_Size,Outlet_Location_Type,Outlet_Type
Numerical variables: Item_Weight, Item_Visibility, Item_MRP ,
Outlet_Establishment_Year, Item_Outlet_Sales.

Lets look at the summary of data

##  Item_Identifier  Item_Weight     Item_Fat_Content Item_Visibility 
##  FDG33  :  10    Min.   : 4.555   LF     : 316     Min.   :0.0036  
##  FDW13  :  10    1st Qu.: 8.774   low fat: 112     1st Qu.:0.0314  
##  DRE49  :   9    Median :12.600   Low Fat:5089     Median :0.0578  
##  DRN47  :   9    Mean   :12.858   reg    : 117     Mean   :0.0705  
##  FDD38  :   9    3rd Qu.:16.850   Regular:2889     3rd Qu.:0.0981  
##  FDF52  :   9    Max.   :21.350                    Max.   :0.3284  
##  (Other):8467    NA's   :1463                      NA's   :526     
##                  Item_Type       Item_MRP      Outlet_Identifier
##  Fruits and Vegetables:1232   Min.   : 31.29   OUT027 : 935     
##  Snack Foods          :1200   1st Qu.: 93.83   OUT013 : 932     
##  Household            : 910   Median :143.01   OUT035 : 930     
##  Frozen Foods         : 856   Mean   :140.99   OUT046 : 930     
##  Dairy                : 682   3rd Qu.:185.64   OUT049 : 930     
##  Canned               : 649   Max.   :266.89   OUT045 : 929     
##  (Other)              :2994                    (Other):2937     
##  Outlet_Establishment_Year Outlet_Size   Outlet_Location_Type
##  Min.   :1985                    :   0   Tier 1:2388         
##  1st Qu.:1987              High  : 932   Tier 2:2785         
##  Median :1999              Medium:2793   Tier 3:3350         
##  Mean   :1998              Small :2388                       
##  3rd Qu.:2004              NA's  :2410                       
##  Max.   :2009                                                
##                                                              
##             Outlet_Type   Item_Outlet_Sales 
##  Grocery Store    :1083   Min.   :   33.29  
##  Supermarket Type1:5577   1st Qu.:  834.25  
##  Supermarket Type2: 928   Median : 1794.33  
##  Supermarket Type3: 935   Mean   : 2181.29  
##                           3rd Qu.: 3101.30  
##                           Max.   :13086.97  
## 

Univariate Plots

Examining the output of the summary function, we see that results are different
for continous and categoricaly variables.
It is observed that type fat content in data is two type low fat and regular
which is divided into five caregories in dataset, Low Fat, low fat, LF, reg,
Regular. Lets fix this problem.

# Changing reg to Regular and LF and low fat to Low Fat
bigmarts_train_imp$Item_Fat_Content<-replace(bigmarts_train_imp$Item_Fat_Content,bigmarts_train_imp$Item_Fat_Content=="reg","Regular")
bigmarts_train_imp$Item_Fat_Content<-replace(bigmarts_train_imp$Item_Fat_Content,bigmarts_train_imp$Item_Fat_Content=="LF","Low Fat")
bigmarts_train_imp$Item_Fat_Content<-replace(bigmarts_train_imp$Item_Fat_Content,bigmarts_train_imp$Item_Fat_Content=="low fat","Low Fat")

By looking at the density plot of Item_MRP it shows there are differenet level
of MRP distribution.So divide into 4 level of categories like: Low, Medium,
high , Very high.

We can see that density plot of Item_Visibility are almost similar before and
after imputation.  

Univariate Plots Section: Divided into two levels of analysis based on Product level and Outlet level

The long tail distribution of sales shows the small number of items have higher
sales whichis approximately 20% of items have 80% of sales.
Lets us move formward to see the distribution of Sales across the categorcals
variables.

it shows Outlet10 and Outlet19 has very low sales relatively.

Items containing low fat are high in sales as comapred to regular fat content
items.

Its apparent that medium sized outlet has higher sales.

Tier 3 has more sales.

Supermarket Type1, Type3 have high value of sales and they reasonably in good
numbers.
Type of grocery stores are higher in number but they have not much sales.

It is important to understand that how long the outlet is established in the
reference of year 2013 since the data is for year 2013.

bigmarts_train_imp$Total_Year_Establishment <- as.factor(2013-bigmarts_train_imp$Outlet_Establishment_Year)
bigmarts_train_imp$I_O_SaleP <- bigmarts_train_imp$Item_Outlet_Sales * 100 / max(bigmarts_train_imp$Item_Outlet_Sales)

Arrange data so that Item Outlet sales becomes tha last column.

Bigmarts <- subset(bigmarts_train_imp, select = c(Item_Identifier,
                                                  Item_Weight,
                                                Item_Fat_Content,
                                                Item_Visibility,
                                                Item_Type,
                                                 Item_MRP,
                                                MRP_Level,
                                               Outlet_Identifier,
                                               Outlet_Size,
                                              Outlet_Location_Type,
                                              Outlet_Type,
                                           Total_Year_Establishment,
                                           Item_Outlet_Sales))

Location Tier3 has good sales for older established outlet.
Tier 2 and 3 have good sales for middles aged location outlet.

Univariate Analysis

What is the structure of your dataset?

Variable Description
Item_Identifier Unique product ID
Item_Weight Weight of product
Item_Fat_Content Whether the product is low fat or not
Item_Visibility % of total display area in store allocated to this product
Item_Type Category to which product belongs
Item_MRP Maximum Retail Price (list price) of product
MRP_Level Category of MRP like Low, Meduim, High, Vey_High
Outlet_Identifier Unique store ID
Total_Year_Establishment Number of years stired is establushed till 2013
Outlet_Size Size of the store
Outlet_Location_Type Type of city in which store is located
Outlet_Type Grocery store or some sort of supermarket
Item_Outlet_Sales Sales of product in particular store. This is the target variabe which is to be predicted.

There are 8523 oobervation with 13 features. Obervations: * There are 1559 unique items and 10 different type of outlets * According to density plot of Item_MRP, it shows there are 4 types
of distribution of data which involves, Low, medium, high and very high. * Number of low fat products are larger than regular products. * Most of the outlet has high and medium MRP level products. * Most popular products are Baking goods, snacks foods, Households,
Fruits and vegetables and Frozen fruits in the outlets. * Most of the outlets are of medium size. * Most of the outlet locations are Tier 3. * OUT035, OUT045, OUT046, OUT027, OUT017 OUT018, OUT013 are mostly available.

What is/are the main feature(s) of interest in your dataset?

The main features in the datasets are Item_MRP level, Item_Weight,
Item_Fat_Content and Item_Outlet_Sales upto this point of analysis.
I would like to determine which features and their combinations are
best to determine the Item sales in the outlet.
I hope to discover more significant features and their combinations by
bivariate and multivariate analysis.
### What other features in the dataset do you think will help support your investigation into your feature(s) of interest? The levels of categorical variables are likely to contribute in the sales
of product in outlet and Total establishement year of the outlet could also
play a role in sales. It is assumed that older the store its more likely
there is more sales of products in that store.

Did you create any new variables from existing variables in the dataset?

The following variables are created:
MRP_Level: The desity plot of Item_MRP showed four different distributions
of MRP,So I have created 4 levels, Low, Medium, High and Very_High prize of the
product.
Total_Year_Establishment: The establishment year of outlet is subtracted
from 2013.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

526 out of 8523 entries of item_Visibility of the product is zero which make
no sense in the outlet.
There is still a sales of the product which have zero visibility.
May be there is mistake in data entry so, I imputed all values which have zero
entries.
# Bivariate Plots Section

Bivariate plots based on Oultet type

From the above distribution of each outlet type Supermarket Type3
has highest sales on total year of 28.
Based on the color schema of Supermarket Type1 has good sales on
total 9 and 14 year of establishment.
Grocery store has lower sales of product.

It is clear that most of the sales are from Tier 3 type of outlet location.

Meduim size outlet sales are higher as shown in above distribution graph.

It is clear that outlet27 has much higher sales as comapred to other outlets.

Bivariate plots based on Product type

The sales of the item type low_fat and regular are almost same for medium
outlet size.

The above graph celarly depict their is linear relatioship between Item MRP
and Outlet sales.
Let see more insight by including color schema and other dimenaions with more
transformations.

## Warning: Removed 132 rows containing non-finite values (stat_smooth).
## Warning: Removed 132 rows containing missing values (geom_point).

I looked at the categorical variables again the Item_Outlet_Sales and Item_MRP.
Applying a log transform to Item_Outlet_Sales and cube-root transform to
Item_MRP produces a more linear trend.
There is almost linear relationship appears between Item_MRP and
Item_Outlet_Sales after certain transformation.
Item_MRP(log10) vs Item_Outlet_Sales(log10) and MRP_Level graph shows almost
perfect linear relation between between two.
By adding color schema we see it almost apparent that MRP_level high and
Very_high of product have high sales values.
But it is also need to understand that high correlation does not necessarily
mean that there is an underlying fundamental relationship between the two
varibles, however can offer some insight for further investigation.

There is no such apparent relation exhibits between Item_Weight and sales from
the above graph.

Item visible more than less than 0.2 are high in sales. Lets take more peep
by breaking into store types.

In Suepr market 1, 2 ,3 items visible leass than 0.2 are more sales as comapred
to grocery store.

There is no such apparent relationship stands out between
Total_Year_Establishment and Item_Outlet_Sales from the verticle strips in
graph but looks like there is kind of nolinear type relationship exists.
But it can be said from above graph that which may be true that oultet which
are old have higher sales.
# Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Item_MRP is strongly correlates with the Item_Outlet_Sales.
MRP level Very_high is also positively correlated with Item_Outlet_Sales.
Outlet identifier is also positively correlated with sales.
There is no such apparenet relationship exhibited between
Total_Year_Establishment and Item_Outlet_Sales.
Item_Vsisibity and Item_Outlet_Sales are slightly negatively correlated.
To make it more clear I added color scheme to know which are those item.
Its clear that Baking Soda, Breads, Breakfast items, Dairy products occupy
less sace but still they have more sales. That nake sense about some of the items which take less space to storage in
outlet does not refer does not mean they have less sales.
Its unfair to rely on the visibility percentange on the sales.
But It was more seen items visibility less than 0.2 hs higher in sales.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

What was the strongest relationship you found?

The Item_Outlet_Sales are strongly correated with Item_MRP.
There are certain stores which are positively correlated with Outlets.
This could be achieved by dummyfying categorcal variables.
# Multivariate Plots Section

Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.

  1. fruits and Vegitable and snacks have more sales
  2. House hold & Frozen Foods is second
  3. Breakfast, Seafood, Hard Drinks, Starchy Foods are less sales

it shows from the graph that low fat items are high in outlet sales.

items taged with Low MRP level are very lo sales.

It shows that outlet27 has much higher sales as compared to other outlets.

## $title
## [1] "Item Visibility vs Outlet Sales"
## 
## $subtitle
## NULL
## 
## attr(,"class")
## [1] "labels"

We see that there is difference between grocery store and super market which
is shown in above scatter plot.
Super market has higher sales.it seemed from the above shown graph that Item_Vsisibity and Item_Outlet_Sales are slightly negatively correlated.
To make it more clear I added color scheme to know which are those item.
Its clear that Baking Soda, Breads, Breakfast items, Dairy products occupy
less sace but still they have more sales.
it also shoes that items with visibility less than 0.2 has high sales.

## [1] 8523 1618
##                       i                     j           cor         p
## 1 Item_Identifier.DRA12 Item_Identifier.DRA24 -0.0007609629 0.9440011
## 2 Item_Identifier.DRA12 Item_Identifier.DRA59 -0.0008135513 0.9401382
## 3 Item_Identifier.DRA24 Item_Identifier.DRA59 -0.0008787875 0.9353482
## 4 Item_Identifier.DRA12 Item_Identifier.DRB01 -0.0004980502 0.9633315
## 5 Item_Identifier.DRA24 Item_Identifier.DRB01 -0.0005379873 0.9603935
##                                     i                             j cor p
## 1223829      Item_Fat_Content.Low.Fat      Item_Fat_Content.Regular  -1 0
## 1290406      Outlet_Identifier.OUT018 Outlet_Type.Supermarket.Type2   1 0
## 1292014      Outlet_Identifier.OUT027 Outlet_Type.Supermarket.Type3   1 0
## 1293619      Outlet_Identifier.OUT018    Total_Year_Establishment.4   1 0
## 1293635 Outlet_Type.Supermarket.Type2    Total_Year_Establishment.4   1 0
## 1295226      Outlet_Identifier.OUT017    Total_Year_Establishment.6   1 0
## 1296839      Outlet_Identifier.OUT035    Total_Year_Establishment.9   1 0
## 1298450      Outlet_Identifier.OUT045   Total_Year_Establishment.11   1 0
## 1300063      Outlet_Identifier.OUT049   Total_Year_Establishment.14   1 0
## 1301666      Outlet_Identifier.OUT010   Total_Year_Establishment.15   1 0

Item_Visibility and Item_Weight does not also appear to be correlated.
# Multivariate Analysis

See the importance of variables in followings which contribute to the sales.
I have used dummyVars Function to translate all factor variables into numerical variables(also used for modeling purpose). After dummyVars Function 1618
combinations of feattures are generated out of which 10 most important
featres are shown.

# Dimesion of newly created features.
dim(BigmartsTrsf)
## [1] 8523 1618
# top 10 best features
selectedSub
##                                     i                 j        cor p
## 1308119                      Item_MRP Item_Outlet_Sales  0.5675744 0
## 1308141     Outlet_Type.Grocery.Store Item_Outlet_Sales -0.4117271 0
## 1308123           MRP_Level.Very_High Item_Outlet_Sales  0.3944153 0
## 1308121                 MRP_Level.Low Item_Outlet_Sales -0.3658668 0
## 1308129      Outlet_Identifier.OUT027 Item_Outlet_Sales  0.3111920 0
## 1308144 Outlet_Type.Supermarket.Type3 Item_Outlet_Sales  0.3111920 0
## 1308124      Outlet_Identifier.OUT010 Item_Outlet_Sales -0.2848826 0
## 1308150   Total_Year_Establishment.15 Item_Outlet_Sales -0.2848826 0
## 1308128      Outlet_Identifier.OUT019 Item_Outlet_Sales -0.2772498 0
## 1308122              MRP_Level.Medium Item_Outlet_Sales -0.2288469 0

Lets us make it more apparent visulization using facet wrap in the following
graph.

ggplot(Bigmarts, aes(Item_Visibility, Item_MRP)) + geom_point(aes(color = Item_Type)) + 
  scale_x_continuous("Item Visibility", breaks = seq(0,0.35,0.05))+
  scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+
  theme_bw() + labs(title="Scatterplot of Item_MRP and Item_Visibility with Item Type")+ facet_wrap( ~ Item_Type)

This shows items visibility less than 0.2 have high in MRP which is interestig
because we already seen Item_MRP is positively correlated with sales and less
than 0.2 visible items have high in sales.

Bigmarts_corr <- subset(Bigmarts, select = c(Item_Weight,Item_Visibility,  Item_MRP, Item_Outlet_Sales))
scatterplotMatrix(Bigmarts_corr[1:4], main = "Scatter plot of numerical variables")                                                

Above the scatter plot shows the correlation between the highly correlated pairs. The diagonal cells show the kernel density plot of each variables.The pairs plot, and in particular the last Item_Outlet_Sales column, tell us a lot about our
data set. Upon examining, it is observed that there is strong correlation exist between
Item_MRP and Item_Outlet_Sales. MRP level Very_high is also positively correlated with Item_Outlet_Sales.
Outlet identyifier is also positively correlated with sales.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Plotting Item_MRP with different facets gave a valuable relation with
Item_Outlet sales.
I was ale to find out the imporant features numerical and categoricals playing
significant role sales.
The top most correlated features are given below.

##  [1] "Item_MRP"                      "Outlet_Type.Grocery.Store"    
##  [3] "MRP_Level.Very_High"           "MRP_Level.Low"                
##  [5] "Outlet_Identifier.OUT027"      "Outlet_Type.Supermarket.Type3"
##  [7] "Outlet_Identifier.OUT010"      "Total_Year_Establishment.15"  
##  [9] "Outlet_Identifier.OUT019"      "MRP_Level.Medium"

Were there any interesting or surprising interactions between features?

I observed that there is no such correlation appeared between
Total_Year_establishment and Item_Outlet_Sales even after certain
transformation of varibles.
It is strange because but there is some older oultlets which has resonble sales.

One thing also fould interesting thatitems visibility less than 0.2 have high
in MRPwhich is interestig because we already seen Item_MRP is positively
correlated with sales and less than 0.2 visible items have high in sales.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

## 
## Calls:
## m1: lm(formula = I(log(Item_Outlet_Sales)) ~ I(Item_MRP^(1/3)), data = Bigmarts)
## m2: lm(formula = I(log(Item_Outlet_Sales)) ~ I(Item_MRP^(1/3)) + 
##     Total_Year_Establishment, data = Bigmarts)
## m3: lm(formula = I(log(Item_Outlet_Sales)) ~ I(Item_MRP^(1/3)) + 
##     Total_Year_Establishment + MRP_Level, data = Bigmarts)
## m4: lm(formula = I(log(Item_Outlet_Sales)) ~ I(Item_MRP^(1/3)) + 
##     Total_Year_Establishment + MRP_Level + Outlet_Identifier, 
##     data = Bigmarts)
## m5: lm(formula = I(log(Item_Outlet_Sales)) ~ I(Item_MRP^(1/3)) + 
##     Total_Year_Establishment + MRP_Level + Outlet_Identifier + 
##     Outlet_Location_Type, data = Bigmarts)
## 
## ===========================================================================================
##                                        m1         m2         m3         m4         m5      
## -------------------------------------------------------------------------------------------
##   (Intercept)                        4.039***   4.079***   3.966***   4.077***   4.077***  
##                                     (0.058)    (0.053)    (0.204)    (0.148)    (0.148)    
##   I(Item_MRP^(1/3))                  0.642***   0.640***   0.664***   0.644***   0.644***  
##                                     (0.011)    (0.009)    (0.036)    (0.026)    (0.026)    
##   Total_Year_Establishment: 6                   0.205***   0.205***   0.205***   0.205***  
##                                                (0.033)    (0.033)    (0.024)    (0.024)    
##   Total_Year_Establishment: 9                   0.226***   0.224***   0.224***   0.224***  
##                                                (0.033)    (0.033)    (0.024)    (0.024)    
##   Total_Year_Establishment: 11                  0.131***   0.131***   0.131***   0.131***  
##                                                (0.033)    (0.033)    (0.024)    (0.024)    
##   Total_Year_Establishment: 14                  0.207***   0.206***   0.206***   0.206***  
##                                                (0.033)    (0.033)    (0.024)    (0.024)    
##   Total_Year_Establishment: 15                 -1.788***  -1.788***  -1.788***  -1.788***  
##                                                (0.038)    (0.038)    (0.028)    (0.028)    
##   Total_Year_Establishment: 16                  0.166***   0.166***   0.166***   0.166***  
##                                                (0.033)    (0.033)    (0.024)    (0.024)    
##   Total_Year_Establishment: 26                  0.146***   0.146***   0.145***   0.145***  
##                                                (0.033)    (0.033)    (0.024)    (0.024)    
##   Total_Year_Establishment: 28                 -0.183***  -0.183***   0.708***   0.708***  
##                                                (0.030)    (0.030)    (0.024)    (0.024)    
##   MRP_Level: Low                                          -0.033     -0.075     -0.075     
##                                                           (0.073)    (0.053)    (0.053)    
##   MRP_Level: Medium                                        0.040      0.024      0.024     
##                                                           (0.036)    (0.026)    (0.026)    
##   MRP_Level: Very_High                                    -0.106**   -0.091***  -0.091***  
##                                                           (0.033)    (0.024)    (0.024)    
##   Outlet_Identifier: OUT019/OUT010                                   -2.470***  -2.470***  
##                                                                      (0.028)    (0.028)    
## -------------------------------------------------------------------------------------------
##   R-squared                              0.3        0.5        0.5        0.7        0.7   
##   adj. R-squared                         0.3        0.5        0.5        0.7        0.7   
##   sigma                                  0.9        0.7        0.7        0.5        0.5   
##   F                                   3282.0      966.1      730.0     1870.8     1870.8   
##   p                                      0.0        0.0        0.0        0.0        0.0   
##   Log-likelihood                    -10849.2    -9238.6    -9221.6    -6483.9    -6483.9   
##   Deviance                            6364.6     4361.5     4344.1     2285.1     2285.1   
##   AIC                                21704.4    18499.2    18471.1    12997.8    12997.8   
##   BIC                                21725.6    18576.7    18569.8    13103.5    13103.5   
##   N                                   8523       8523       8523       8523       8523     
## ===========================================================================================

Final Plots and Summary

Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.

Plot One

Description One

The distribution of item sales are in long tails which reflect the real life
phenomenon. The long tail distribution of sales shows the small number of items have higher
sales which is approximately 20% of items have 80% of sales.

Plot Two

Description Two

The log10 transformation of sales and cuberoot of MRP graph shows the linear
trend. It can be possible to build a linear predictive model.

Plot Three

plot the highly correlated pairs using the {psych} package’s pair.panels

Description Three

The scatter plot shows the correlation between the highly correlated pairs.
The diagonal cells show the kernel density plot of each variables.
The pairs plot and in particular the last Item_Outlet_Sales column, tell us a
lot about our data set.
Upon examining, it is observed that there is strong correlation exist between
Item_MRP and Item_Outlet_Sales.  MRP level Very_high is also positively correlated with Item_Outlet_Sales.
Outlet identyifier is also positively correlated with sales. Top 10 most correalated feature appear to be shown below.
This could be also possible to find the correlation betweeen categorical
variables.
### Reflection The Bigmarts data is information obout 1559 Item and 10 outles and their sales
for year 2013.
It has 8523 observations about 12 variables.The data set is divided into two
type of variables:   Categorical variables: Item_Identifier, Item_Fat_Content,Item_Type,Outlet_Identifier,Outlet_Size,
Outlet_Location_Type,Outlet_Type.  Numerical variables: Item_Weight, Item_Visibility, Item_MRP , Outlet_Establishment_Year,
Item_Outlet_Sales I started by understanding the individual variables in the data set and plots
their graph to find the interesting pattern or trends.
The categorcal variables hade many interesting information in the data set.
I explored the Item_Outlet_sales across many variabes and found out the
interesting relations between them. It was clear that there is a positive
correlation between Item_MRP and Item_Outlet_Sales.
The one thing a bit surprising for me was that the density plot of
Item_Outlet_Sales and Item_Visibility are almost same which lead to strong
positive correation but there no relation stand out between two.

This seemed really intereting to combine two cateforcal variables to predict
the sales(eg: Item+Outlet -> Sales).
I have built the linear model between Item_MRP and Sales by adding some
categorcal variables but I believe it could be improved a lot in terms of
accuracy and performaance. I am planning to build a more accurate predictive
model in future when i will be versed in Machine learning with R as this was
the first time I was working with R. But It was amazed ploting graph with R.